Linear Mixed Models (LMMs) - Moderation & Comparisons

Joshua F. Wiley / Michelle L. Byrne

2025-05-09

These are the R packages we will use.

options(digits = 4)

## emmeans is a new package

library(data.table)
library(JWileymisc)
library(extraoperators)
library(lme4)
## Loading required package: Matrix
library(lmerTest)
## 
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
## 
##     lmer
## The following object is masked from 'package:stats':
## 
##     step
library(multilevelTools)
library(visreg)
library(ggplot2)
library(ggpubr)
library(haven)
library(emmeans)
## Welcome to emmeans.
## Caution: You lose important information if you filter this package's results.
## See '? untidy'
## load data collection exercise data
## merged is a a merged long dataset of baseline and daily
dm <- as.data.table(read_sav("[2021] PSY4210 merged.sav"))

## Remind R which of our variables are factors

dm[, sex := factor(
  sex,
  levels = c(1,2),
  labels = c("male", "female"))]

dm[, relsta := factor(
  relsta, levels = c(1,2,3), 
  labels = c("single", "in a committed exclusive relationship", "in a committed nonexclusive relationship"))]

1 LMM Notation

Let’s consider the formula for a relatively simple LMM:

\[ y_{ij} = b_{0j} + b_1 * x_{1j} + b_2 * x_{2ij} + \varepsilon_{ij} \]

Here as before, the i indicates the ith observation for a specific unit (e.g., person but the unit could also be classrooms, doctors, etc.) and the j indicates the jth unit (in psychology usually person).

Regression coefficients, the \(b\)s with a j subscript indicate a fixed and random effect. That is, the coefficient is allowed to vary across the units, j. As before, these coefficients in practice are decomposed into a fixed and random part:

\[ b_{0j} = \gamma_{00} + u_{0j} \]

and we estimate in our LMM the fixed effect part, \(\gamma_{00}\), and the variance / standard deviation of the random effect or the covariance matrix if there are multiple random effects, \(\mathbf{G}\):

\[ u_{0j} \sim \mathcal{N}(0, \mathbf{G}) \]

Regression coefficients without any j subscript indicate fixed only effects, effects that do not vary across units, j. These are fixed effects and get estimated directly.

Predictors / explanatory variables, the \(x\)s with an i subscript indicate that the variable varies within a unit. Note that the outcome, \(y\) must vary within units to be used in a LMM.

In this case, the notation tells us the following:

The following decision tree provides some guide to when a predictor / explanatory variable can be a fixed and random effect.

Type of effect decision tree

Let’s see two examples of putting this basic model into practice.

\[ energy_{ij} = b_{0j} + b_1 * loneliness_{j} + b_2 * stress_{ij} + \varepsilon_{ij} \]

The corresponding R code is:

summary(lmer(dEnergy ~ loneliness + dStress + (1 | ID), data = dm))
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: dEnergy ~ loneliness + dStress + (1 | ID)
##    Data: dm
## 
## REML criterion at convergence: 950.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6560 -0.6269 -0.0158  0.7592  2.5046 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 0.401    0.633   
##  Residual             1.354    1.163   
## Number of obs: 283, groups:  ID, 88
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)   5.5473     0.4162 105.2638   13.33   <2e-16 ***
## loneliness   -0.4092     0.1732  84.4459   -2.36    0.020 *  
## dStress      -0.1621     0.0597 270.1607   -2.72    0.007 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) lnlnss
## loneliness -0.778       
## dStress    -0.439 -0.171

Here is another example decomposing stress into a between and within component.

\[ energy_{ij} = b_{0j} + b_1 * Bstress_{j} + b_2 * Wstress_{ij} + \varepsilon_{ij} \]

dm[, c("Bstress", "Wstress") := meanDeviations(dStress), by = ID]

summary(lmer(dEnergy ~ Bstress + Wstress + (1 | ID), data = dm))
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: dEnergy ~ Bstress + Wstress + (1 | ID)
##    Data: dm
## 
## REML criterion at convergence: 955.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.7137 -0.6580 -0.0296  0.7295  2.4510 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 0.435    0.66    
##  Residual             1.361    1.17    
## Number of obs: 283, groups:  ID, 88
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)   5.0538     0.4016  89.5608   12.58   <2e-16 ***
## Bstress      -0.2522     0.0945  88.0221   -2.67    0.009 ** 
## Wstress      -0.1402     0.0764 206.0826   -1.84    0.068 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##         (Intr) Bstrss
## Bstress -0.967       
## Wstress  0.000  0.000

We can make more effects random effects. For example, taking our earlier example and just changing \(b_2\) into \(b_{2j}\):

\[ energy_{ij} = b_{0j} + b_1 * loneliness_{j} + b_{2j} * stress_{ij} + \varepsilon_{ij} \]

The corresponding R code is:

summary(lmer(dEnergy ~ loneliness + dStress + (dStress | ID), data = dm))
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: dEnergy ~ loneliness + dStress + (dStress | ID)
##    Data: dm
## 
## REML criterion at convergence: 947.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6247 -0.6572 -0.0176  0.7130  2.5089 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  ID       (Intercept) 1.3303   1.153         
##           dStress     0.0612   0.247    -0.87
##  Residual             1.2957   1.138         
## Number of obs: 283, groups:  ID, 88
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)   5.3885     0.4364 89.1320   12.35   <2e-16 ***
## loneliness   -0.3678     0.1692 83.2661   -2.17    0.033 *  
## dStress      -0.1456     0.0678 60.1490   -2.15    0.036 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) lnlnss
## loneliness -0.738       
## dStress    -0.531 -0.134

Now with two random effects, we assume that the random effects, \(u_{0j}\) and \(u_{2j}\), which we collectively denote \(\mathbf{u}_{j}\) follow a multivariate normal distribution with covariance matrix \(\mathbf{G}\).

\[ \mathbf{u}_{j} \sim \mathcal{N}(0, \mathbf{G}) \]

Based on the little decision chart, between unit only variables, like \(loneliness_j\) and \(Bstress_j\) cannot be random effects. In the data collection exercise, we measured loneliness at baseline and also in the daily diary questionnaires. In this example we are using the baseline (trait) loneliness and not the daily one. Also, while it is technically possible for something to only be a random effect without a corresponding fixed effect, its not common and not recommended as it would be equivalent to assuming that the fixed effect, the mean of the distribution, is 0, which is rarely appropriate.

2 Interactions in LMMs

Interactions in LMMs work effectively the same way that interactions in GLMs do, although there are a few nuances in options and possible interpretations. Using the notation from above, let’s consider a few different possible interactions.

2.1 Cross Level (Between and Within Unit) Interactions

First, let’s take our model with loneliness and stress and include an interaction. Here is the model without an interaction.

\[ energy_{ij} = b_{0j} + b_1 * loneliness_{j} + b_2 * stress_{ij} + \varepsilon_{ij} \]

The corresponding R code is:

summary(lmer(dEnergy ~ loneliness + dStress + (1 | ID), data = dm))
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: dEnergy ~ loneliness + dStress + (1 | ID)
##    Data: dm
## 
## REML criterion at convergence: 950.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6560 -0.6269 -0.0158  0.7592  2.5046 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 0.401    0.633   
##  Residual             1.354    1.163   
## Number of obs: 283, groups:  ID, 88
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)   5.5473     0.4162 105.2638   13.33   <2e-16 ***
## loneliness   -0.4092     0.1732  84.4459   -2.36    0.020 *  
## dStress      -0.1621     0.0597 270.1607   -2.72    0.007 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) lnlnss
## loneliness -0.778       
## dStress    -0.439 -0.171

Now let’s add the interaction, as a fixed effect.

\[ energy_{ij} = b_{0j} + b_1 * loneliness_{j} + b_2 * stress_{ij} + b_3 * (loneliness_{j} * stress_{ij}) + \varepsilon_{ij} \]

The corresponding R code is:

## long way
summary(lmer(dEnergy ~ loneliness + dStress + loneliness:dStress + (1 | ID), data = dm))
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: dEnergy ~ loneliness + dStress + loneliness:dStress + (1 | ID)
##    Data: dm
## 
## REML criterion at convergence: 952.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6070 -0.6090 -0.0285  0.7734  2.4928 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 0.40     0.632   
##  Residual             1.36     1.165   
## Number of obs: 283, groups:  ID, 88
## 
## Fixed effects:
##                    Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)          6.0085     0.9593 203.5738    6.26  2.2e-09 ***
## loneliness          -0.6410     0.4681 228.2603   -1.37     0.17    
## dStress             -0.2747     0.2193 228.3756   -1.25     0.21    
## loneliness:dStress   0.0555     0.1041 247.7271    0.53     0.59    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) lnlnss dStrss
## loneliness  -0.962              
## dStress     -0.919  0.877       
## lnlnss:dStr  0.901 -0.929 -0.962
## short hand in R for simple main effect + interaction
## identical, but shorter to the above
summary(lmer(dEnergy ~ loneliness * dStress + (1 | ID), data = dm))
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: dEnergy ~ loneliness * dStress + (1 | ID)
##    Data: dm
## 
## REML criterion at convergence: 952.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6070 -0.6090 -0.0285  0.7734  2.4928 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 0.40     0.632   
##  Residual             1.36     1.165   
## Number of obs: 283, groups:  ID, 88
## 
## Fixed effects:
##                    Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)          6.0085     0.9593 203.5738    6.26  2.2e-09 ***
## loneliness          -0.6410     0.4681 228.2603   -1.37     0.17    
## dStress             -0.2747     0.2193 228.3756   -1.25     0.21    
## loneliness:dStress   0.0555     0.1041 247.7271    0.53     0.59    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) lnlnss dStrss
## loneliness  -0.962              
## dStress     -0.919  0.877       
## lnlnss:dStr  0.901 -0.929 -0.962

The relevant, new, part is the interaction term, \(b_3\), a fixed effect in this case. If we focus just on that one term, we see that the coefficient, \(b_3\) is applied to the arithmetic product of two variables, here loneliness and stress. As it happens, one of them, loneliness, varies between units whereas the other, stress, varies within units. You will sometimes see this termed as “cross level” interaction between it involves a between and within varying variable.

\[ b_3 * (loneliness_{j} * stress_{ij}) \]

As with interactions for regular GLMs, interactions in LMMs can be interpretted in different ways. The two common interpretations are easiest to see by factoring the regression equation. Here are three equal equations that highlight different ways of viewing the interaction.

In the latter two formats, it highlights how the simple effect of stress varies by loneliness and how the simple effect of loneliness varies by stress.

\[ \begin{align} energy_{ij} &= b_{0j} + b_1 * loneliness_{j} + b_2 * stress_{ij} + b_3 * (loneliness_{j} * stress_{ij}) + \varepsilon_{ij} \\ &= b_{0j} + b_1 * loneliness_{j} + (b_2 + b_3 * loneliness_j) * stress_{ij} + \varepsilon_{ij} \\ &= b_{0j} + (b_1 + b_3 * stress_{ij}) * loneliness_{j} + b_2 * stress_{ij} + \varepsilon_{ij} \\ \end{align} \]

The nuance in LMMs comes in because some variables vary only between units and others within units. For example, when interpretting the interaction with respect to the simple effect of stress, we could say that the association between daily stress and energy on the same day depends on the loneliness of a participant. Conversely, when interpretting with respect to the simple effect of loneliness, we could say that the association of participant loneliness and average energy depends on how stressed someone is feeling on a given day. Loneliness varies between people, stress varies within people, so that must be taken into account in the interpretation.

2.2 Between Unit Interactions

The same approach would work with other type of variables in LMMs. For example, here we have a model with loneliness and sex as predictors. Both vary only between units.

\[ \begin{align} energy_{ij} &= b_{0j} + b_1 * loneliness_{j} + b_2 * sex_{j} + b_3 * (loneliness_{j} * sex_{j}) + \varepsilon_{ij} \\ &= b_{0j} + b_1 * loneliness_{j} + (b_2 + b_3 * loneliness_j) * sex_{j} + \varepsilon_{ij} \\ &= b_{0j} + (b_1 + b_3 * sex_{j}) * loneliness_{j} + b_2 * sex_{j} + \varepsilon_{ij} \\ \end{align} \]

When interpretting the interaction with respect to the simple effect of sex, we could say that the association between participant sex and average energy depends on the loneliness of a participant. Conversely, when interpretting with respect to the simple effect of loneliness, we could say that the association of participant loneliness and average energy depends on participant’s sex.

2.3 Within Unit Interactions

Finally, both variables could vary within units.

\[ \begin{align} energy_{ij} &= b_{0j} + b_1 * selfesteem_{ij} + b_2 * stress_{ij} + b_3 * (selfesteem_{ij} * stress_{ij}) + \varepsilon_{ij} \\ &= b_{0j} + b_1 * selfesteem_{ij} + (b_2 + b_3 * selfesteem_{ij}) * stress_{ij} + \varepsilon_{ij} \\ &= b_{0j} + (b_1 + b_3 * stress_{ij}) * selfesteem_{ij} + b_2 * stress_{ij} + \varepsilon_{ij} \\ \end{align} \]

When interpretting the interaction with respect to the simple effect of stress, we could say that the association between daily stress and energy on the same day depends on same day self-esteem level. Conversely, when interpretting with respect to the simple effect of self-esteem, we could say that the association of daily self-esteem and same day energy depends on how stressed someone is feeling on a given day.

3 Continuous Interactions in R

Aside from the notes about some minor interpretation differences, in general interactions in LMMs are analysed, graphed, and interpreted the same way as for GLMs.

First to avoid any issues around diagnostics etc. from haven labeled type data, we will convert the variable we are going to work with to numeric. Then we fit a LMM with an interaction between stress and neuroticism, energy as the outcome and a random intercept as the only random effect.

dm[, dSE := as.numeric(dSE)]
dm[, dMood := as.numeric(dMood)]
dm[, dEnergy := as.numeric(dEnergy)]
dm[, dStress := as.numeric(dStress)]
dm[, neuroticism := as.numeric(neuroticism)]

m <- lmer(dEnergy ~ neuroticism * dStress + (1 | ID), data = dm)

A quick check of the model diagnostics suggests that the data look fairly good. The intercepts do not appear to follow a normal distribution that closely, partly due to the long left tail, but for now we will leave it. If you’re interested in how to fix left skews, please see the bonus material at the end of this markdown (optional).

plot(modelDiagnostics(m), nrow = 2, ncol = 2, ask = FALSE)

No extreme values are present. If there were any, we can remove that, as we have discussed in depth in previous lectures, update the model, and re-run diagnostics. In practice it could take a few rounds of extreme value removal or you may decide to stop at one round.

Let’s look at the summary of our model, m. Although we have used summary() a lot in the past, we’ll introduce another function to help look at lmer() model results, modelTest(). In this lecture, we will only learn and interpret part of its output, with the rest of the output from modelTest() covered later. In addition to get nicely formatted results rather than a set of datasets containing the results, we use the APAStyler() function.

APAStyler(modelTest(m))
## Parameters and CIs are based on REML, 
## but modelTests requires ML not REML fit for comparisons, 
## and these are used in effect sizes. Refitting.
##                            Term                     Est           Type
##                          <char>                  <char>         <char>
##  1:                 (Intercept)  6.24*** [ 4.93,  7.56]  Fixed Effects
##  2:                     dStress   -0.37* [-0.69, -0.04]  Fixed Effects
##  3:                 neuroticism   -0.49* [-0.88, -0.10]  Fixed Effects
##  4:         neuroticism:dStress     0.07 [-0.03,  0.16]  Fixed Effects
##  5:           sd_(Intercept)|ID                    0.61 Random Effects
##  6:                       sigma                    1.17 Random Effects
##  7:                    Model DF                       6  Overall Model
##  8:                  N (Groups)                 ID (88)  Overall Model
##  9:            N (Observations)                     283  Overall Model
## 10:                      logLik                 -469.35  Overall Model
## 11:                         AIC                  950.70  Overall Model
## 12:                         BIC                  972.57  Overall Model
## 13:                 Marginal R2                    0.09  Overall Model
## 14:                 Marginal F2                    0.10  Overall Model
## 15:              Conditional R2                    0.27  Overall Model
## 16:              Conditional F2                    0.37  Overall Model
## 17:         neuroticism (Fixed)     0.04/0.01, p = .015   Effect Sizes
## 18:             dStress (Fixed)     0.02/0.00, p = .027   Effect Sizes
## 19: neuroticism:dStress (Fixed)     0.01/0.00, p = .160   Effect Sizes

The results show the regression coefficients, asterisks for p-values, and 95% confidence intervals in brackets for the fixed effects, the standard deviations of the random effects, the model degrees of freedom, which is how many parameters were estimated in the model total, and the number of people and observations. For now, we will ignore all the output under row 9, N (Observations). In the case of this model we can see the following.

A LMM was fit with 283 observations from 88 people. There was no significant neuroticism x stress interaction, b [95% CI] = 0.07 [-0.03, 0.16], p = .160.

We can also pass multiple model results in a list together, which puts the results side by side. This is particularly helpful for comparing models with and without covariates, to evaluate whether removing extreme values changed the results substantially, or to compare models with different outcomes.

# Remember: m <- lmer(dEnergy ~ neuroticism * dStress + (1 | ID), data = dm)

# Here's a new model with mood as the outcome to compare:

mtest1 <- lmer(dMood ~ neuroticism * dStress + (1 | ID), data = dm)

APAStyler(list(
  Energy = modelTest(m),
  Mood = modelTest(mtest1) ))
## Parameters and CIs are based on REML, 
## but modelTests requires ML not REML fit for comparisons, 
## and these are used in effect sizes. Refitting.
## Parameters and CIs are based on REML, 
## but modelTests requires ML not REML fit for comparisons, 
## and these are used in effect sizes. Refitting.
##                            Term                  Energy                    Mood
##                          <char>                  <char>                  <char>
##  1:                 (Intercept)  6.24*** [ 4.93,  7.56]  7.07*** [ 6.05,  8.08]
##  2:                     dStress   -0.37* [-0.69, -0.04]  -0.42** [-0.67, -0.17]
##  3:                 neuroticism   -0.49* [-0.88, -0.10]    -0.17 [-0.47,  0.14]
##  4:         neuroticism:dStress     0.07 [-0.03,  0.16]     0.00 [-0.07,  0.07]
##  5:           sd_(Intercept)|ID                    0.61                    0.43
##  6:                       sigma                    1.17                    0.93
##  7:                    Model DF                       6                       6
##  8:                  N (Groups)                 ID (88)                 ID (88)
##  9:            N (Observations)                     283                     283
## 10:                      logLik                 -469.35                 -402.04
## 11:                         AIC                  950.70                  816.09
## 12:                         BIC                  972.57                  837.96
## 13:                 Marginal R2                    0.09                    0.29
## 14:                 Marginal F2                    0.10                    0.41
## 15:              Conditional R2                    0.27                    0.41
## 16:              Conditional F2                    0.37                    0.69
## 17:         neuroticism (Fixed)     0.04/0.01, p = .015     0.01/0.01, p = .282
## 18:             dStress (Fixed)     0.02/0.00, p = .027     0.05/0.02, p = .001
## 19: neuroticism:dStress (Fixed)     0.01/0.00, p = .160     0.00/0.00, p = .969
##               Type
##             <char>
##  1:  Fixed Effects
##  2:  Fixed Effects
##  3:  Fixed Effects
##  4:  Fixed Effects
##  5: Random Effects
##  6: Random Effects
##  7:  Overall Model
##  8:  Overall Model
##  9:  Overall Model
## 10:  Overall Model
## 11:  Overall Model
## 12:  Overall Model
## 13:  Overall Model
## 14:  Overall Model
## 15:  Overall Model
## 16:  Overall Model
## 17:   Effect Sizes
## 18:   Effect Sizes
## 19:   Effect Sizes

These results show us that we have similar results when predicting daily energy and daily mood from stress and neuroticism. The relationship between daily stress and daily mood, and daily stress and daily energy did not vary by individual differences in neuroticism. In this case, it would make sense to re-run these models without the interaction term to test the main effects of daily stress and neuroticism.

mmain <- lmer(dEnergy ~ neuroticism + dStress + (1 | ID), data = dm)
mtest1main <- lmer(dMood ~ neuroticism + dStress + (1 | ID), data = dm)

APAStyler(list(
  Energy = modelTest(mmain),
  Mood = modelTest(mtest1main) ))
## Parameters and CIs are based on REML, 
## but modelTests requires ML not REML fit for comparisons, 
## and these are used in effect sizes. Refitting.
## Parameters and CIs are based on REML, 
## but modelTests requires ML not REML fit for comparisons, 
## and these are used in effect sizes. Refitting.
##                    Term                  Energy                    Mood
##                  <char>                  <char>                  <char>
##  1:         (Intercept)  5.46*** [ 4.74,  6.17]  7.04*** [ 6.50,  7.59]
##  2:             dStress   -0.15* [-0.27, -0.03] -0.41*** [-0.51, -0.32]
##  3:         neuroticism   -0.24* [-0.42, -0.06]   -0.16* [-0.30, -0.03]
##  4:   sd_(Intercept)|ID                    0.61                    0.43
##  5:               sigma                    1.17                    0.93
##  6:            Model DF                       5                       5
##  7:          N (Groups)                 ID (88)                 ID (88)
##  8:    N (Observations)                     283                     283
##  9:              logLik                 -470.33                 -402.04
## 10:                 AIC                  950.67                  814.09
## 11:                 BIC                  968.90                  832.31
## 12:         Marginal R2                    0.08                    0.29
## 13:         Marginal F2                    0.09                    0.41
## 14:      Conditional R2                    0.27                    0.41
## 15:      Conditional F2                    0.37                    0.69
## 16: neuroticism (Fixed)     0.05/0.00, p = .009     0.04/0.02, p = .020
## 17:     dStress (Fixed)     0.02/0.01, p = .013     0.30/0.16, p < .001
##               Type
##             <char>
##  1:  Fixed Effects
##  2:  Fixed Effects
##  3:  Fixed Effects
##  4: Random Effects
##  5: Random Effects
##  6:  Overall Model
##  7:  Overall Model
##  8:  Overall Model
##  9:  Overall Model
## 10:  Overall Model
## 11:  Overall Model
## 12:  Overall Model
## 13:  Overall Model
## 14:  Overall Model
## 15:  Overall Model
## 16:   Effect Sizes
## 17:   Effect Sizes

Results are indeed similar for daily mood and energy as outcomes. There are significant negative associations between daily stress and both outcome variables, and also neuroticism and both outcome variables.

3.1 Plotting

Typically, to plot our significant interaction, a few exemplar lines are graphed showing the slope of one variable with the outcome at different values of the moderator. As with GLMs, we can use the visreg() function. Here, we’ll use neuroticism as the moderator. A common approach to picking level of the moderator is to use the Mean - 1 SD and Mean + 1 SD. To do that, we first need the mean and standard deviation of neuroticism, which we can get using egltable() after excluding duplicates by ID, since neuroticism only varies between units. Note that we are doing this for the sake of example only since our interaction was not signficant.

egltable(c("neuroticism"), data = dm[!duplicated(ID)])
##                     M (SD)
##         <char>      <char>
## 1: neuroticism 3.46 (1.11)
visreg(m, xvar = "dStress",
       by = "neuroticism", overlay=TRUE,
       breaks = c(3.46-1.11, 3.46+1.11),
       partial = FALSE, rug = FALSE)

The results show, in an easier to interpret way, what the positive interaction coefficient of \(b = 0.07\) means, people with higher levels of neuroticism are less sensitive to the (negative) effects of stress. People higher in neuroticism are relatively less sensitive to the effects of stress, although in both cases, higher stress is associated with lower energy levels. Keep in mind again that we are interpreting this as though the interaction was signficant for the sake of example only.

Another common way of picking some exemplar values is to use the 25th and 75th percentiles. These work particularly well for very skewed distributions where the mean +/- SD could be outside the observed range of the data. Again we exclude duplicates by ID and then use the quantile() function to get the values, 3, and 4.5 for the 25th and 75th percentiles.

quantile(dm[!duplicated(ID), neuroticism], na.rm = TRUE)
##   0%  25%  50%  75% 100% 
##  1.0  3.0  3.5  4.5  5.0
visreg(m, xvar = "dStress",
       by = "neuroticism", overlay=TRUE,
       breaks = c(3, 4.5),
       partial = FALSE, rug = FALSE)

3.2 Simple Effects

When working with models that have interactions, a common aid to interpretation is to test the simple effects / slopes from the model. For example, previously we graphed the association between stress and energy at M - 1 SD and M + 1 SD on neuroticism. However, although visually both lines appeared to have a negative slope, we do not know from the graph alone whether there is a significant association between stress and energy at both the low (M - 1 SD) and high (M + 1 SD) levels of neuroticism. To answer that, we need to test the simple slope of stress at specific values of neuroticism. Again this is for demonstration only given our non- significant moderation. We would not need to compute simple slopes or effects for non-significant moderations in reality.

Our default model does actually give us one simple slope: it is the simple slope of stress when \(neuroticism = 0\). However, as we can tell from the mean and standard deviation of neuroticism, 0 is very far outside the plausible range of values so that simple slope given to us by default from the model is not too useful. We could either center neuroticism and re-run the model, which would get us a different simple slope, or use post hoc functions to calculate simple slopes.

We will use the emtrends() function from the emmeans package to test the simple slopes. This function also works with GLMs, for your reference.

The emtrends() function take a model as its first argument, then the variable that you want to calculate a simple slope for, here stress, the argument at requires a list of specific values of the moderator, and then we tell it how we want degrees of freedom calculated (note this only applies to lmer models). We store the results in an R object, mem and then call summary() to get a summary table. The infer = TRUE argument is needed in summary() if you want p-values.

mem <- emtrends(m, var = "dStress",
                at = list(neuroticism = c(3.46-1.11, 3.46+1.11)),
                lmer.df = "satterthwaite")

summary(mem, infer=TRUE)
##  neuroticism dStress dStress.trend     SE  df lower.CL upper.CL t.ratio p.value
##         2.35    4.13       -0.2109 0.0743 213   -0.357  -0.0644  -2.838  0.0050
##         4.57    4.13       -0.0653 0.0858 263   -0.234   0.1036  -0.761  0.4472
## 
## Degrees-of-freedom method: satterthwaite 
## Confidence level used: 0.95

The relevant parts of the output, for us, are the columns for stress.trend which are the simple slopes, the values of neuroticism which tell us at what values of neuroticism we have calculated simple slopes, the confidence intervals, lower.CL and upper.CL, 95% by default, and the p-value. From these results, we can see that when \(neuroticism = 2.35\) there is a significant negative association between stress and energy, but not when \(neuroticism = 4.57\).

3.3 Sample Write Up

With all of this information, we can plan out some final steps for a polished write up of the results. First, let’s get exact p-values for all our results. We can do this through options to pcontrol in APAStyler(). We also re-print the simple slopes here.

APAStyler(modelTest(m),
  pcontrol = list(digits = 3, stars = FALSE, includeP = TRUE,
                  includeSign = TRUE, dropLeadingZero = TRUE))
## Parameters and CIs are based on REML, 
## but modelTests requires ML not REML fit for comparisons, 
## and these are used in effect sizes. Refitting.
##                            Term                          Est           Type
##                          <char>                       <char>         <char>
##  1:                 (Intercept)  6.24p < .001 [ 4.93,  7.56]  Fixed Effects
##  2:                     dStress -0.37p = .029 [-0.69, -0.04]  Fixed Effects
##  3:                 neuroticism -0.49p = .016 [-0.88, -0.10]  Fixed Effects
##  4:         neuroticism:dStress  0.07p = .166 [-0.03,  0.16]  Fixed Effects
##  5:           sd_(Intercept)|ID                         0.61 Random Effects
##  6:                       sigma                         1.17 Random Effects
##  7:                    Model DF                            6  Overall Model
##  8:                  N (Groups)                      ID (88)  Overall Model
##  9:            N (Observations)                          283  Overall Model
## 10:                      logLik                      -469.35  Overall Model
## 11:                         AIC                       950.70  Overall Model
## 12:                         BIC                       972.57  Overall Model
## 13:                 Marginal R2                         0.09  Overall Model
## 14:                 Marginal F2                         0.10  Overall Model
## 15:              Conditional R2                         0.27  Overall Model
## 16:              Conditional F2                         0.37  Overall Model
## 17:         neuroticism (Fixed)          0.04/0.01, p = .015   Effect Sizes
## 18:             dStress (Fixed)          0.02/0.00, p = .027   Effect Sizes
## 19: neuroticism:dStress (Fixed)          0.01/0.00, p = .160   Effect Sizes
summary(mem, infer=TRUE)
##  neuroticism dStress dStress.trend     SE  df lower.CL upper.CL t.ratio p.value
##         2.35    4.13       -0.2109 0.0743 213   -0.357  -0.0644  -2.838  0.0050
##         4.57    4.13       -0.0653 0.0858 263   -0.234   0.1036  -0.761  0.4472
## 
## Degrees-of-freedom method: satterthwaite 
## Confidence level used: 0.95

Now we will make a polished, finalized figure. I have customized the colours, and turned off the legends. In place of legends, I have manually added text annotations including the simple slopes and confidence intervals and p-values for the simple slopes1 For your reference, it took about 8 trial and errors of different x and y values and angles to get the text to line up about right. I did not magically get the values to use to get a graph that I thought looked nice. That is why I think sometimes it is easier to add this sort of text after the fact in your slides or papers rather than building it into the code..

visreg(m, xvar = "dStress",
       by = "neuroticism", overlay = TRUE,
       breaks = c(3.46-1.11, 3.46+1.11),
       partial = FALSE, rug = FALSE, gg = TRUE,
       xlab = "Daily Stress",
       ylab = "Predicted Daily Energy") +
  scale_color_manual(values = c("2.35" = "black", "4.57" = "grey70")) +
  theme_pubr() +
  guides(colour = "none", fill = "none") +
  annotate(geom = "text", x = 3.2, y = 3.9, label = "High Neuroticism: b = -0.07 [-0.23, 0.10], p = .447",
           angle = -9) + 
  annotate(geom = "text", x = 4, y = 4.4, label = "Low Neuroticism: b = -0.21 [-0.36, -0.06], p = .005",
           angle = -22)
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

A linear mixed model using restricted maximum likelihood was used to test whether the association of daily stress on daily energy is moderated by baseline neuroticism scores. All predictors were included as fixed effects and a random intercept by participant was included. Visual diagnostics showed that energy was normally distributed, and no outliers were present.

The daily stress x neuroticism interaction was not statistically significant, which indicated that the relationship between stress and energy did not vary by neuroticism. Results from the analysis with the interaction dropped revealed that both neuroticism and daily stress were negatively associated with daily energy.

4 Continuous x Categorical Interactions in R

Continuous x Categorical interactions are conducted much as continuous x continuous interactions. Typically with continuous x categorical interactions, simple slopes for the continuous variable are calculated at all levels of the categorical variable.

Let’s illustrate this with a model examining the relationship between daily energy levels and self-esteem with sex as a moderator.

mconcat <- lmer(dSE ~ dEnergy*sex + (1| ID), data = dm) 

The model diagnostics look relatively good, albeit not perfect.

plot(modelDiagnostics(mconcat), nrow = 2, ncol = 2, ask = FALSE)

With reasonable diagnostics, we can look at a summary. There is one extreme residual but I’m choosing to leave it in the dataset.

summary(mconcat)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: dSE ~ dEnergy * sex + (1 | ID)
##    Data: dm
## 
## REML criterion at convergence: 771.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0275 -0.5344  0.0658  0.5515  2.6785 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 0.489    0.699   
##  Residual             0.607    0.779   
## Number of obs: 282, groups:  ID, 88
## 
## Fixed effects:
##                   Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)          2.235      0.564 269.175    3.96  9.4e-05 ***
## dEnergy              0.619      0.120 273.898    5.16  4.9e-07 ***
## sexfemale            0.878      0.600 268.799    1.46    0.144    
## dEnergy:sexfemale   -0.274      0.128 272.789   -2.13    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) dEnrgy sexfml
## dEnergy     -0.907              
## sexfemale   -0.940  0.853       
## dEnrgy:sxfm  0.850 -0.937 -0.904

Factor variables in interactions do not work currently with modelTest(), so if we wanted to use it, we’d need to manually dummy code the categorical variable. The results are identical.

dm[, female := as.integer(sex == "female")]

malt <- lmer(dSE ~ dEnergy * female + (1 | ID), data = dm)

APAStyler(modelTest(malt))
## Parameters and CIs are based on REML, 
## but modelTests requires ML not REML fit for comparisons, 
## and these are used in effect sizes. Refitting.
##                       Term                     Est           Type
##                     <char>                  <char>         <char>
##  1:            (Intercept)  2.23*** [ 1.13,  3.34]  Fixed Effects
##  2:                dEnergy  0.62*** [ 0.38,  0.85]  Fixed Effects
##  3:         dEnergy:female   -0.27* [-0.53, -0.02]  Fixed Effects
##  4:                 female     0.88 [-0.30,  2.05]  Fixed Effects
##  5:      sd_(Intercept)|ID                    0.70 Random Effects
##  6:                  sigma                    0.78 Random Effects
##  7:               Model DF                       6  Overall Model
##  8:             N (Groups)                 ID (88)  Overall Model
##  9:       N (Observations)                     282  Overall Model
## 10:                 logLik                 -380.57  Overall Model
## 11:                    AIC                  773.14  Overall Model
## 12:                    BIC                  794.99  Overall Model
## 13:            Marginal R2                    0.23  Overall Model
## 14:            Marginal F2                    0.29  Overall Model
## 15:         Conditional R2                    0.56  Overall Model
## 16:         Conditional F2                    1.30  Overall Model
## 17:        dEnergy (Fixed)     0.09/0.15, p < .001   Effect Sizes
## 18:         female (Fixed)     0.00/0.01, p = .144   Effect Sizes
## 19: dEnergy:female (Fixed)     0.02/0.05, p = .035   Effect Sizes

4.1 Plotting

With continuous x categorical interactions, the easiest approach is to plot the simple slope of the continuous variable by the categorical one as shown in the following.

visreg(mconcat, xvar = "dEnergy",
       by = "sex", overlay=TRUE,
       partial = FALSE, rug = FALSE)

4.2 Simple Effects

When working with models that have interactions, a common aid to interpretation is to test the simple effects / slopes from the model. For example, previously we graphed the association between daily energy and self-esteem at each level of the categorical sex variable, i.e. for men and women. However, we cannot tell from the graph whether daily energy is significantly associated with self-esteem for men or women.

To answer that, we need to test the simple slope of energy at the two sex levels. Our default model does actually give us one simple slope: it is the simple slope of energy when for men (i.e when female = 0), but we might want more.

We will use the emtrends() function from the emmeans package to test the simple slopes.

The emtrends() function take a model as its first argument, then the variable that you want to calculate a simple slope for, here energy, the argument at requires a list of specific values of the moderator, and then we tell it how we want degrees of freedom calculated (note this only applies to lmer models). We store the results in an R object, mem and then call summary() to get a summary table. The infer = TRUE argument is needed in summary() if you want p-values.

mem <- emtrends(mconcat, var = "dEnergy",
                at = list(sex = c("male", "female")),
                lmer.df = "satterthwaite")

summary(mem, infer=TRUE)
##  dEnergy sex    dEnergy.trend     SE  df lower.CL upper.CL t.ratio p.value
##        4 male           0.619 0.1200 274    0.383    0.856   5.155  <.0001
##        4 female         0.346 0.0449 261    0.257    0.434   7.689  <.0001
## 
## Degrees-of-freedom method: satterthwaite 
## Confidence level used: 0.95

The relevant parts of the output, for us, are the columns for dEnergy.trend which are the simple slopes, the values of sex which tell us at what values of sex we have calculated simple slopes, the confidence intervals, lower.CL and upper.CL, 95% by default, and the p-value. From these results, we can see that daily energy is significantly associated with self-esteem for any sex, although it is stronger for male participants than female participants.

5 Categorical x Categorical Interactions in R

Categorical x Categorical interactions are conducted comparably, although more contrasts / simple effect follow-ups are possible.

Here we will work with Int_Str again, which is a two-level categorical predictor (0 = no interaction with a stranger, 1 = interacted with a stranger that day) and energy as the outcome. We also work with a three-level conscientiousness variable.

## create a categorical conscientiousness variable
dm[, cons3 := cut(conscientiousness, breaks = quantile(conscientiousness, probs = c(0, 1/3, 2/3, 1), na.rm=TRUE),
                    labels = c("Low", "Mid", "High"),
                    include.lowest = TRUE)]

mcat2 <- lmer(dEnergy ~ cons3 * Int_Str + (1 | ID), data = dm)

The model diagnostics look relatively good.

plot(modelDiagnostics(mcat2), nrow = 2, ncol = 2, ask = FALSE)

With reasonable diagnostics, we can look at a summary.

summary(mcat2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: dEnergy ~ cons3 * Int_Str + (1 | ID)
##    Data: dm
## 
## REML criterion at convergence: 943.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6025 -0.6516  0.0136  0.7113  2.2028 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 0.464    0.681   
##  Residual             1.315    1.147   
## Number of obs: 283, groups:  ID, 88
## 
## Fixed effects:
##                   Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)         3.6934     0.1635 118.5695   22.59  < 2e-16 ***
## cons3Mid           -0.0514     0.3064  97.6260   -0.17  0.86716    
## cons3High           0.6270     0.2645 119.6594    2.37  0.01935 *  
## Int_Str             0.7898     0.2317 266.3274    3.41  0.00075 ***
## cons3Mid:Int_Str   -0.9873     0.5050 272.8990   -1.96  0.05157 .  
## cons3High:Int_Str  -0.6233     0.3791 276.2360   -1.64  0.10128    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) cns3Md cns3Hg Int_St c3M:I_
## cons3Mid    -0.534                            
## cons3High   -0.618  0.330                     
## Int_Str     -0.446  0.238  0.276              
## cns3Md:In_S  0.205 -0.375 -0.127 -0.459       
## cns3Hgh:I_S  0.273 -0.146 -0.462 -0.611  0.280

5.1 Plotting

With categorical x categorical interactions, visreg() produces OK but not great figures as shown in the following. We can see the means of energy for all 6 cells (the cross of 3 level of conscientiousness x 2 levels of Int_Str).

dm[, Int_Str := factor(
  Int_Str,
  levels = c(0,1),
  labels = c("No interaction with stranger", "Interacted with stranger"))]

visreg(mcat2, xvar = "Int_Str",
       by = "cons3", overlay=TRUE,
       partial = FALSE, rug = FALSE)

5.2 Simple Effects

When working with two categorical interactions (or with a categorical predictor with >2 levels where you want to test various group differences), the emmeans() function from the emmeans package is helpful. We can get the means of interaction with stranger by conscientiousness group and get confidence intervals and p-values. These p-values test whether each mean is different from zero, by default.

## re-run mcat2 now with int_str as factor
mcat2 <- lmer(dEnergy ~ cons3 * Int_Str + (1 | ID), data = dm)

em <- emmeans(mcat2, "Int_Str", by = "cons3",
              lmer.df = "satterthwaite")
summary(em, infer = TRUE)
## cons3 = Low:
##  Int_Str                      emmean    SE    df lower.CL upper.CL t.ratio
##  No interaction with stranger   3.69 0.164 118.6     3.37     4.02  22.586
##  Interacted with stranger       4.48 0.216 212.4     4.06     4.91  20.770
##  p.value
##   <.0001
##   <.0001
## 
## cons3 = Mid:
##  Int_Str                      emmean    SE    df lower.CL upper.CL t.ratio
##  No interaction with stranger   3.64 0.259  90.5     3.13     4.16  14.058
##  Interacted with stranger       3.44 0.432 245.6     2.59     4.29   7.981
##  p.value
##   <.0001
##   <.0001
## 
## cons3 = High:
##  Int_Str                      emmean    SE    df lower.CL upper.CL t.ratio
##  No interaction with stranger   4.32 0.208 120.3     3.91     4.73  20.784
##  Interacted with stranger       4.49 0.273 183.0     3.95     5.03  16.446
##  p.value
##   <.0001
##   <.0001
## 
## Degrees-of-freedom method: satterthwaite 
## Confidence level used: 0.95

A nice plot, with confidence intervals for the fixed effects, can be obtained by using the emmip() function from the emmeans package. It takes as input the results from emmeans(), not the lmer() model results directly. Here is a simple plot showing the categorical interactions. Note that with this approach, you could basically fit the same model(s) that you would with a repeated measures or mixed effects ANOVA model, with the advantage that LMMs do not require balanced designs and allow both categorical and continuous predictors (e.g., you could include continuous covariates easily). GLMs and (G)LMMs can do everything that t-tests and various ANOVAs can, but with greater flexibility.

emmip(em, cons3~Int_Str, CIs = TRUE) +
  theme_pubr() +
  ylab("Predicted Energy")

6 Comparing Models

For many statistical models, including LMMs, it can be informative to compare different models. Comparing different models can be used in lots of different ways. Here are some examples, although they are not meant to be exhaustive.

We will look at examples of the different uses of model comparisons in this topic.

Just as there are many different kinds of models we can fit, even with LMMs (e.g., with or without random slopes, etc.), so to there are many different kinds and purposes for different model comparisons.

To begin with, let’s imagine we are just trying to compare two models: Model A and Model B. We can broadly classify the type of comparison based on whether Model A and B are nested or non-nested models. We will talk about what these mean next.

6.1 Nested Models

Models are considered nested when one model is a restricted or constrained version of another model. For example, suppose that Model A predicts mood from stress and loneliness whereas Model B predicts mood from loneliness only. Written as a formula, these could be:

\[ ModelA: mood_{ij} = b_{0j} + b1 * loneliness_j + b2 * stress_{ij} + \varepsilon_{ij} \]

and

\[ ModelB: mood_{ij} = b_{0j} + b1 * loneliness_j + 0 * stress_{ij} + \varepsilon_{ij} \]

In Model B, I purposely used \(0 * stress_{ij}\) to highlight that when a predictor is left out of a model, it is the same as fixing (sometimes also called constraining) the coefficient (\(b2\) in this case) to be exactly 0. In this case, we would say that Model B is nested within Model A. In other words, Model A contains every predictor and parameter in Model B plus more.

Restricted

This idea is similar to the idea of nested data used in LMMs, but the difference is that we are not talking about observations or data, rather we are talking about the parameters of a model.

To summarize, briefly, we say that Model B is nested within Model A if:

If two models are nested, then we have the most options in terms of comparing the two models. For example, we can evaluate whether Model A is a statistically significantly better fit than is Model B using a Likelihood Ratio Test (LRT)2 https://en.wikipedia.org/wiki/Likelihood-ratio_test.

We can compare the fit of each model and use the difference in fit to derive effect sizes. We also can attribute any differences in the two models to the parameter(s) that have been constrainted to 0 in Model B from Model A.

A simple definition of the LRT test statistic, \(\lambda\) is based on two times the difference in the log likelihoods.

\[ \lambda = -2 * (LL_B - LL_A) \]

You may wonder why the likelihood ratio test can be conducted by taking the difference in the log likelihoods. It is because the log of a ratio is the same as the difference in the logs of the numerator and denominator.

\[ log_{e}\left(\frac{6}{2}\right) = log_{e}(6) - log_{e}(2) \]

which we can confirm is true in R:

## log of ratio
log(6/2)
## [1] 1.099
## difference in logs
log(6) - log(2)
## [1] 1.099

If the null hypothesis of no difference is true in the population, then \(\lambda\) will follow a chi-squared distribution with degrees of freedom equal to the number of parameters constrained to 0 in Model B from Model A, the difference in degrees of freedom used for each model, that is:

\[ \lambda \sim \chi^2(DF_A - DF_B) \]

Thus we often use a chi-square distribution in the LRT to look up the p-value. Finally, note that because LRTs are based on the log likelihoods (LL) from a model, we need true log likelihoods for the LRT to be valid. Therefore, we cannot use restricted maximum likelihood, we need to use maximum likelihood estimation.

6.1.1 Nested Models in R

To see the idea of nested models and the LRT in action, let’s examine a concrete example in R. Here are two LMMs corresponding to Model A and Model B formula we wrote previously. We can see in the R code that the models are nested, the only difference is stress. We set REML = FALSE to get maximum likelihood estimates so that we get true log likelihoods. We also need to confirm that the two models are based on the same number of observations. We can extract just this information in R using the nobs() function. This lets us confirm that the two models are fitted to the same data. For example, if stress had some missing data that was not missing loneliness or mood, it could be that Model A is based on fewer observations than Model B.

modela <- lmer(dMood ~ loneliness + dStress + (1 | ID), data = dm, REML = FALSE)
modelb <- lmer(dMood ~ loneliness + (1 | ID), data = dm, REML = FALSE)

nobs(modela)
## [1] 283
nobs(modelb)
## [1] 283

In this case, we can see that the number of observations are identical. Now we can find the log likelihoods of both models by using the logLik() function.

logLik(modela)
## 'log Lik.' -396.8 (df=5)
logLik(modelb)
## 'log Lik.' -432.8 (df=4)

Now we have the log likelihoods (LL) of each model and the degrees of freedom from both models. From this, we can calculate \(\lambda\) and then look up the p-value for a chi-square distribution with \(\lambda\) and 1 degree of freedom (1 is the difference in degrees of freedom between Model A and B). To get the p-value from a chi-square, we use the pchisq() function.

## lambda
-2 * (-432.79 - -396.81)
## [1] 71.96
## p-value from a chi-square
pchisq(71.96, df = 1, lower.tail = FALSE)
## [1] 2.196e-17

In practice, we do not need to do these steps by hand, we can get a test of two nested models in R using the anova() function (which in this case is not actually analyzing the variance really). We use the anova() function with test = "LRT" to get a likelihood ratio test.

anova(modela, modelb, test = "LRT")
## Data: dm
## Models:
## modelb: dMood ~ loneliness + (1 | ID)
## modela: dMood ~ loneliness + dStress + (1 | ID)
##        npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)    
## modelb    4 874 888   -433       866                        
## modela    5 804 822   -397       794    72  1     <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you compare the chi-square value, degrees of freedom and p-value, you’ll see that they basically match what we calculated by hand. The small differences are due to rounding (R will use more decimals of precision whereas we only used two decimals).

In this simple case, we are only testing a single parameter, the fixed regression coefficient for stress, because that is the only parameter that differs between Model A and Model B. Thus in this case, it would be easier to rely on the t-test we get from a summary of the model.

summary(modela)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: dMood ~ loneliness + dStress + (1 | ID)
##    Data: dm
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##     803.6     821.8    -396.8     793.6       278 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -2.935 -0.665  0.145  0.656  2.163 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 0.106    0.326   
##  Residual             0.875    0.936   
## Number of obs: 283, groups:  ID, 88
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)   7.5359     0.2862  93.3199   26.33  < 2e-16 ***
## loneliness   -0.4935     0.1167  73.2663   -4.23  6.7e-05 ***
## dStress      -0.4132     0.0446 241.5678   -9.26  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) lnlnss
## loneliness -0.740       
## dStress    -0.472 -0.197

In this instance, the LRT and the t-test actually yield equivalent p-values, 2e-16, however, this does not have to be. The LRT is based on slightly different theory than the t-test and the t-test uses approximate degrees of freedom for the t-distribution based on the Satterthwaite method, whereas the LRT does not directly incorporate the sample size in the same way. The two methods will be assymptotically equivalent (at very large sample sizes) but can differ, particularly for smaller samples.

In practice, we wouldn’t usually use a LRT to evaluate whether a single model parameter is statistically significant. The benefit of (nested) model comparisons is that it allows us to compare two models. Those models can be quite different.

Here are another two models, this time they differ by two predictors, energy and sex. The LRT now tests whether the addition of energy and sex results in significantly better fit for Model A than Model B.

modela <- lmer(dMood ~ loneliness + stress + dEnergy + sex + (1 | ID),
               data = dm, REML = FALSE)
modelb <- lmer(dMood ~ loneliness + stress + (1 | ID), data = dm,
               REML = FALSE)

nobs(modela)
## [1] 283
nobs(modelb)
## [1] 283
anova(modela, modelb, test = "LRT")
## Data: dm
## Models:
## modelb: dMood ~ loneliness + stress + (1 | ID)
## modela: dMood ~ loneliness + stress + dEnergy + sex + (1 | ID)
##        npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)    
## modelb    5 871 889   -430       861                        
## modela    7 771 797   -378       757   104  2     <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, we can see from the significant p-value that Model A is a significantly better fit to the data than is Model B. Note that LRTs are only appropriate when the models are nested. We cannot use LRTs for non-nested models.

In nested models, the more complex model is often called the “Full” model and the simpler model the “Reduced” or “Restricted” version of the Full model.

6.2 Non Nested Models

Models are considered non nested when one model is not strictly a constrained version of a more complex model. For example, suppose that Model A predicts mood from loneliness and stress whereas Model B predicts mood from loneliness and sex. Written as a formula, these could be:

\[ ModelA: mood_{ij} = b_{0j} + b1 * loneliness_j + b2 * stress_{ij} + \varepsilon_{ij} \]

and

\[ ModelB: mood_{ij} = b_{0j} + b1 * loneliness_j + b2 * sex_j + 0 * stress_{ij} + \varepsilon_{ij} \]

Although Model B does have loneliness which also is present in Model A and it has restricted stress to be 0, it has another addition, sex, which is not in Model A. In this case, the two models are not nested. In this case, although we could fit both models and they are both on the same number of observations, so the outcome is the same in both models, the models are not nested. Although we can still ask R to conduct a LRT, this LRT is not valid. It is shown here to highlight that you as the analyst are responsible for determining whether your two models are nested or not and therefore deciding whether a LRT is an appropriate way to evaluate whether the models are significantly different from each other, or not.

modela <- lmer(dMood ~ loneliness + dStress + (1 | ID),
               data = dm, REML = FALSE)
modelb <- lmer(dMood ~ loneliness + sex + (1 | ID), data = dm,
               REML = FALSE)

nobs(modela)
## [1] 283
nobs(modelb)
## [1] 283
## this is NOT appropriate LRT
anova(modela, modelb, test = "LRT")
## Data: dm
## Models:
## modela: dMood ~ loneliness + dStress + (1 | ID)
## modelb: dMood ~ loneliness + sex + (1 | ID)
##        npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
## modela    5 804 822   -397       794                    
## modelb    5 876 894   -433       866     0  0

Although we cannot conduct a LRT on non nested models, it is still useful to compare the fit of non-nested models. For example, if one is a much better fit than another model, that may suggest one set of predictors is superior or can be used to evaluate competing hypotheses (e.g., Theory 1 says that stress and family history are the most important predictors of mood and Theory 2 says that age and sleep are the best predictors of mood — these two theories are competing, not nested versions of each other).

We cannot use LRTs to compare non nested models, but we can use other measures, including performance measures such as variance explained or model accuracy and information criterion. We will talk about information criterion next.

6.3 Information Criterion

Two common information criterion are:

Both the AIC and BIC are calculated based primarily on the log likelihood, LL, of a model. One way of thinking about these information criterion is that you could think about models being a way of approximating reality. Suppose you have two models that both generate approximations (predictions) of reality. The model whose predictions are closer to the observed data will have a higher LL. The LL can be used as a relative measure of model fit. LL is not an absolute measure of fit. That is, we do not interpret a specific LL value as indicating “good” fit. Only which of a set of (all potentially bad) models is the best.

However, there is a limitation with using LL alone. The LL will always stay the same or increase as additional predictors / parameters are added to the model. Thus if we use the LL alone, out of a set of competing models, we would virtually always pick the more complex models. To address this, we need to incorporate some penalty for the complexity of a model, so that to choose a more complex over simpler model as the “best” it has to improve the LL enough.

The AIC and BIC are very similar except that they use different penalties for model complexity (technically they are derived from rather different theoretical foundations, but for practical purposes they are similar other than the complexity penalties).

A common way of defining model complexity is based on the number of estimated model parameters, \(k\). For example, consider the following simple linear regression for \(n\) different observations:

\[ \begin{align} \eta_{i} &= b_0 + b_1 * x_i\\ y_i &\sim \mathcal{N}(\eta_i, \sigma_{\varepsilon_i})\\ \end{align} \]

The parameters are:

\[ b_0, b_1, \sigma_{\varepsilon_i} \]

so \(k = 3\).

The equations for AIC and BIC are quite easy to follow and looking at them helps understand where they are similar and different.

\[ \begin{align} AIC &= 2 * k - 2 * LL \\ BIC &= log_{e}(n) * k - 2 * LL \\ \end{align} \]

\(n\) is the number of observations included in the model and \(LL\) is the log likelihood, where higher values indicate a better fitting model. These equations highlight that the only difference between the AIC and BIC is whether the number of parameters in the model, \(k\) are multiplied by \(2\) (AIC) or \(log_{e}(n)\). Thus, the AIC and BIC will be identical if:

\[ \begin{align} log_{e}(n) &= 2 \\ n &= e^2 \approx 7.39 \\ \end{align} \]

If \(n < e^2 \approx 7.39\) then the BIC will have a weaker penalty based on the number of parameters, \(k\), than the AIC. If \(n > e^2 \approx 7.39\) then the BIC will have a stronger penalty based on the number of parameters, \(k\), than the AIC. Functionally, a stronger penalty on the number of parameters means that a particular information criterion will tend to favor more parsimonious (less complex) models. Thus, for all but the tiniest of sample sizes (\(\approx 7.39\)) the BIC will have a stronger penalty on model complexity and so will favor relatively more parsimonious models than will AIC.

For both AIC and BIC, the relatively better model of those compared is the model with the lower value (i.e., lower = better for both AIC and BIC).

There is no “right” or “wrong” information criterion to use. People use both the AIC and/or the BIC. If both AIC and BIC suggest the same model is the “best” there is no ambiguity. Sometimes the AIC and BIC disagree regarding which model is the best. In these cases one must pick which information criterion to go with. Both criterion require that the number of observations, \(n\), be larger than the number of parameters, \(k\) for them to operate well.

A benefit of the AIC and BIC is that both can be used to compare non nested models and they also can be used to compare nested models. It is relatively common to use AIC and/or BIC to “choose” amongst a set of possible models.

In R we can usually calculate AIC and BIC using the functions, AIC() and BIC(). Here we will calculate the AIC and BIC for our two models.

AIC(modela, modelb)
##        df   AIC
## modela  5 803.6
## modelb  5 875.6
BIC(modela, modelb)
##        df   BIC
## modela  5 821.8
## modelb  5 893.8

In this case, both the AIC and BIC are lower for Model A than for Model B, indicating that Model A is the optimal of those two models. Again, the relative interpretation is important. We cannot conclude that Model A is a “good” model, only that it is better than Model B.

6.4 Model Selection

Integrating what we have learned about LRT for nested models and AIC/BIC for nested or non nested models, we can compare across several models to select the best model.

First, we fit a series of models. In all cases, mood is the outcome. First we have an intercept only model (m0), an unconditional model, as a baseline reference. This is helpful as all the fit indices are relative. Then we fit polynomials of stress with degree 1 (linear; m1); degree 2 (quadratic; m2); degree 3 (cubic; m3).

We fit different degree polynomials of stress using the poly() function in R. Finally we fit a competing model, maybe a linear effect of energy is a better predictor of mood than stress.

Next, we check that all the observations are identical across models. If the observations were not the same across models, we would need to create a new dataset that had any missing data on any variable used in any of the models excluded. This is a critical step if needed, because LRTs and AIC and BIC are only valid if based on the same data.

dm[, dStress := as.numeric(dStress)]
dm[, dMood := as.numeric(dMood)]

m0 <- lmer(dMood ~ 1 + (1 | ID), data = dm, REML = FALSE)
m1 <- lmer(dMood ~ poly(dStress, 1) + (1 | ID), data = dm, REML = FALSE)
m2 <- lmer(dMood ~ poly(dStress, 2) + (1 | ID), data = dm, REML = FALSE)
m3 <- lmer(dMood ~ poly(dStress, 3) + (1 | ID), data = dm, REML = FALSE)
malt <- lmer(dMood ~ dEnergy + (1 | ID), data = dm, REML = FALSE)

## check all the observations are the same
nobs(m0)
## [1] 283
nobs(m1)
## [1] 283
nobs(m2)
## [1] 283
nobs(m3)
## [1] 283
nobs(malt)
## [1] 283

Now we can see which model is the best fit. For the nested models (m0 - m3) we can use LRTs. Technically, m0 also is nested in malt, so we can use a LRT for that too. We cannot use a LRT for, say, m1 and malt as those models are not nested. We can use AIC and BIC for all models, though. Note that with multiple models, the anova() function is doing sequential LRT tests (e.g., m0 vs m1; m1 vs m2, etc.) not all compared to one model. If you want two specific models compared, specify just two.

#### LRTs for nested models ####

## two specific comparisons
anova(m0, m3, test = "LRT")
## Data: dm
## Models:
## m0: dMood ~ 1 + (1 | ID)
## m3: dMood ~ poly(dStress, 3) + (1 | ID)
##    npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)    
## m0    3 891 902   -442       885                        
## m3    6 820 842   -404       808    77  3     <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## sequential comparisons
anova(m0, m1, m2, m3, test = "LRT")
## Data: dm
## Models:
## m0: dMood ~ 1 + (1 | ID)
## m1: dMood ~ poly(dStress, 1) + (1 | ID)
## m2: dMood ~ poly(dStress, 2) + (1 | ID)
## m3: dMood ~ poly(dStress, 3) + (1 | ID)
##    npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)    
## m0    3 891 902   -442       885                        
## m1    4 817 832   -405       809 75.42  1     <2e-16 ***
## m2    5 819 837   -404       809  0.56  1       0.46    
## m3    6 820 842   -404       808  1.02  1       0.31    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m0, malt, test = "LRT")
## Data: dm
## Models:
## m0: dMood ~ 1 + (1 | ID)
## malt: dMood ~ dEnergy + (1 | ID)
##      npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)    
## m0      3 891 902   -442       885                        
## malt    4 783 797   -387       775   110  1     <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## AIC and BIC for nested and non nested models
AIC(m0, m1, m2, m3, malt)
##      df   AIC
## m0    3 890.9
## m1    4 817.5
## m2    5 818.9
## m3    6 819.9
## malt  4 782.6
BIC(m0, m1, m2, m3, malt)
##      df   BIC
## m0    3 901.9
## m1    4 832.1
## m2    5 837.2
## m3    6 841.8
## malt  4 797.2

From the LRT we can see that m3 is significantly better fit than m0. Looking at the sequential tests, however, m2 is no better than m1 and m3 is no better than m2, which might suggest m1 is the best model. If we look at the AIC values, for the models with stress (m1,m2,m3), AIC selects m1 as the best model (by 1.45 compared to m2, very close). BIC also selects m1 over m2. However, both AIC and BIC indicate that the alternate model (malt) is the best of all the models evaluated, suggesting that linear energy is a better predictor of mood than is linear, quadratic, or cubic polynomials of stress.

If we were picking a stress model only, we would pick the linear (m1) model, because it is best on all indices (LRT, AIC, BIC). If we were willing to pick energy over stress, we would clearly go with energy.

In addition to testing fixed effects, these same methods can be used to test random effects. Here we will fit a new model, m4, that includes a random slope of stress and the correlation between the random intercept and stress slope. Then we compare the linear stress with random slope (m4), linear stress without random slope (m1) and intercept only model (m0). Please note that m4 did not converge, so normally we would simplify our model but we don’t want to remove the random slope of stress for demonstration purposes - so in this case we will have to remember not to “trust” our results for m4.

m4 <- lmer(dMood ~ dStress + (1 + dStress | ID), data = dm, REML = FALSE)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00337365 (tol = 0.002, component 1)
## check all the observations are the same
nobs(m0)
## [1] 283
nobs(m1)
## [1] 283
nobs(m4)
## [1] 283
anova(m0, m1, test = "LRT")
## Data: dm
## Models:
## m0: dMood ~ 1 + (1 | ID)
## m1: dMood ~ poly(dStress, 1) + (1 | ID)
##    npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)    
## m0    3 891 902   -442       885                        
## m1    4 817 832   -405       809  75.4  1     <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m1, m4, test = "LRT")
## Data: dm
## Models:
## m1: dMood ~ poly(dStress, 1) + (1 | ID)
## m4: dMood ~ dStress + (1 + dStress | ID)
##    npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)   
## m1    4 817 832   -405       809                       
## m4    6 812 834   -400       800  9.22  2       0.01 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m0, m4, test = "LRT")
## Data: dm
## Models:
## m0: dMood ~ 1 + (1 | ID)
## m4: dMood ~ dStress + (1 + dStress | ID)
##    npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)    
## m0    3 891 902   -442       885                        
## m4    6 812 834   -400       800  84.6  3     <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(m0, m1, m4)
##    df   AIC
## m0  3 890.9
## m1  4 817.5
## m4  6 812.3
BIC(m0, m1, m4)
##    df   BIC
## m0  3 901.9
## m1  4 832.1
## m4  6 834.1

From these tests we can conclude several things. Firstly adding a linear trend of stress as a fixed effect is significantly better than the intercept only model. Secondly (if our model had converged), adding a random slope of stress and correlation of the random slope and intercept is also significantly better than the fixed linear effect of stress model. Thirdly, the model with stress as both a fixed and random slope is significantly better than the intercept only model (m4 vs m0 is kind of an omnibus test of the effect of adding stress as both a fixed and random slope into the model).

Finally, AIC favours m4 and and BIC favours m1 over the other models, and since m4 did not converge, suggests that m1 is the best balance of complexity and model fit (and certainly better than m0).

The main point of this final exercise is to show how you can use LRTs and AIC and BIC to evaluate whether random effects “help” a model fit.

7 Summary

7.1 Conceptual

Key points to take away conceptually are:

7.2 Code

Function What it does
lmer() estimate a LMM
confint() calculate confidence intervals for a LMM
visreg() create marginal or conditional graphs from a LMM
modelDiagnostics() evaluate model diagnostics for LMMs including of multivariate normality
summary() get a summary of model results
modelTest() along with APAStyler() get a nicely formatted summary of a model results.
emmeans() test specific means from a model.
emtrends() test simple slopes from a model.
AIC() calculates AIC for LMMs
BIC() calculates BIC for LMMs
anova() can be used to compare two nested LMMs and calculate a LRT
modelTest() along with APAStyler() get a nicely formatted summary of a model results, including many automatic effect sizes.
poly() Fit a polynomial of specified degree for a predictor in a LMM.

7.3 Extra (in case you ever want to fix left skews - optional material)

Applying transformations to left skewed data is more difficult as generally transformations work on long right tails. A solution is to reverse the variable, apply the transformation and then again reverse it so that the direction is the same as it originally was. We could try a square root transformations which is milder than a log transformation. To reverse it, we subtract the variable from the sum of its minimum and maximum. Next we take its square root, then we reverse by again subtracting from the sum of its minimum and maximum, but square root transformed.

Let’s sidetrack a little and try this out with an example using a slightly skewed outcome variable, dMood.

## First let's see how the model looks like with original dMood scores

mtest1 <- lmer(dMood ~ neuroticism * dStress + (1 | ID), data = dm)

mt1 <- modelDiagnostics(mtest1) 
plot(mt1, nrow = 2, ncol = 2, ask = FALSE)

## Now let's try to fix the very mild left skew just for the sake
## of demonstration

max(dm$dMood) + min(dm$dMood)
## [1] 8
## transform
dm[, moodtrans := sqrt(8) - sqrt(8 - dMood)]

mtest <- lmer(moodtrans ~ neuroticism * dStress + (1 | ID), data = dm)

mt <- modelDiagnostics(mtest) 
plot(mt, nrow = 2, ncol = 2, ask = FALSE)

The transformation appears to have modestly helped the distribution of residuals. Its not that clear whether it was bad enough to begin with and whether the transformation improved it enough that it is worth the difficulty in interpretation (dMood is now square root transformed and that must be incorporated into its interpretation). For the lecture, we did this for demonstration purposes, but in practice, consider whether this is worth it or only adds difficulty in understanding without improving / changing results much.